C. Seth Lester, ASA, MAAA 2024-11-01
This project was inspired by insights shared in a recent LinkedIn post where I explored whether the CDC’s Social Vulnerability Index (SVI) could accurately predict county-level voter turnout in US Presidential elections.
Public presidential polls often use likely voter models to re-weight raw polling samples, incorporating demographic factors such as race/ethnicity, poverty level, and education level, which overlap significantly with components of the SVI. In healthcare analytics, the SVI is commonly used to account and control for geographic variations in social determinants of health that can influence or confound causal relationships between healthcare interventions and consequent cost/utilization patterns for a population.
Motivated by Yubin Park’s concept of blending unrelated datasets to create “scientific curries,” I set out to investigate how social vulnerabilities might impact civic engagement - particularly through the lens of examining the relationship between social vulnerability and voter turnout at the county level of granularity.
The analysis relies on three primary data sources:
This data is accessed using a US Census Bureau
API key and the tidycensus R package. This data
provides demographic and population estimates for counties, which are
integral in this analysis for determining county-level estimates of the
Voting Age Population (VAP) and Voting Eligible Population (VEP). This
work of staging this data is done in the R script
src/get_vep_totals.R.
The VAP is calculated by determining the number of individuals age 18 and over in a particular county using 5-year ACS data. VEP is calculated by subtracting the count of non-citizen individuals aged 18 and over from VAP. While these counts might slightly overestimate true VAP/VEP due to the inclusion of certain ineligible groups (e.g., felons in some states), they provide a consistent and reliable set of denominators for county-level turnout analysis.
As there is generally a 1-2 year lag between the ACS measures for a time period and the time these measures are compiled and released by the US Census Bureau, this project uses VAP/VEP totals at the county level based on the ACS data for the five year period ending in YYYY - 2 for any presidential election year in YYYY. For example, the VEP/VAP measures used to calculate turnout rates at the county level for the 2016 election are determined using 5-year ACS measures from 2010 - 2014.
Ultimately, our goal is to devise a prediction model for turnout using ACS / SVI measures that tend to have a 1-2 year lag in availability. Ultimately, if we want to build a turnout model to predict 2024 election turnout, we will need to use 2018 - 2022 5-year ACS measures, as that will be the latest data available for constructing SVI measures.
MIT Election Lab data
contains historical county-level election returns, including data from
2012, 2016, and 2020. This data allows for a comprehensive analysis of
turnout trends over multiple election cycles. This data is staged in the
R script src/get_election_data.R.
This data is relatively straightforward, with one record per county that is joinable to the VEP/VAP data gathered from ACS and the SVI factors gathered from the CDC. One issue in this data is tabulations for county votes in Alaska. Alaska is uncommon in that the entirety of the state is not subdivided into counties - some people live outside of counties (Boroughs) - so for the sake of analyzing the relationship between SVI and turnout, I thought it best to remove Alaska vote data. We’re removing a very small piece of the sample. Sorry Alaska!
Caveat: don’t use this turnout model to predict Alaska results!
The SVI is produced every 2-4 years by the CDC based on measures contained in the 5-year ACS data. This freely-downloadable dataset contains an overall SVI score for counties / census tracts in the US. The overall SVI score is further distilled from four component scores that measure social vulnerability on four categories: socioeconomic status, household composition, racial/ethnic minority status, and housing type/transportation.
These scores (and their component pieces) are used subsequently in
this analysis to quantify impact of social factors on turnout rates in
Presidential elections. The SVI data required for this analysis is
loaded and staged in the R script src/get_svi.R.
The process used by CDC/ATSDR to calculate SVI with the underlying ACS 5-year measures underwent a large number of changes in 2020. In order to evaluate the concern that SVI (or its components) are not sufficiently stable over time, I visually evaluated the SVI (and four underlying component measures) over time for the top 25 largest (by population) US counties to check that the SVI redesign in 2020 did not led to substantial volatility in the measure (or, at least, moreso than the actual year 2020 would have added to the measure).
To better understand variation over time in the SVI measures, I examined the history of variation in the overall SVI measure for the top 25 counties in the US (ranked by population size).
First, we start with overall SVI, which ranks each county in the US by percentile of overall social vulnerability based on the component sum of the four social vulnerability themes:
The overall SVI for each county is determined by adding the sum of each of the percentile rankings of these four themes together for each particular county, and then percentile-ranking the overall sum for each county. It’s important to note that no one factor is given larger “weight” than another in this calculation, which makes the computation of SVI quite simple - but also might leave something to be desired in terms of accurately measuring social vulnerability.
Next, I examined the four component SVI themes over time, separately.
With the possible exception of the third component of SVI (the Racial & Ethnic Minority Status component), the four SVI components appear to be relatively stable over time when considering that in 2020 we could expect a considerable amount of variation in how SVI was measured due to both pandemic-related factors as well as underlying bias in ACS measurements sampled in 2020. Another possible explanation for the instability of the third component could be that in 2020 this component was heavily redefined.
When considering our intended task of building a turnout model with the SVI, we might consider using the underlying variables for this cateogory instead of the component percentile rank variables, as those are more stable over time.
The goal of this project is to eventually use election returns turnout data for 2012, 2016, and 2020 to develop a turnout model that uses SVI data from the prior 2 years to predict turnout at the county level.
Prior to beginning this modeling exercise I wanted to better understand the distribution of our response - presidential election turnout rates at the county level.
First we want to understand the distribution of turnout using both our denominators (VAP and VEP). Note that sometimes (in 11 cases) the total number of votes received in a county will equal or exceed estimates of VAP. This is typically due to counties that are VERY small with regard to population, so the ACS estimate of population is likely to be an undercount.
The 11 non-Alaska counties where this occurs are very sparsely-populated rural counties with merely hundreds (at most, 2,416) residents, and were won by both Democratic and Republican candidates. There is NOT sufficient data precision in ACS population estimates to prove anything about unauthorized people voting, so put those silly tinfoil hats away, please.
First, we want to join our election data (from one source) to our ACS 5-year population estimates for VEP and VAP (from another source). We will join on FIPS code and then calculate turnout rates using both VAP and VEP as denominators. Then we will analyze the distribution of both turnout measures and remove any outliers (perhaps due to very high turnout in very low-population areas).
We define an outlier for both VAP- and VEP-based turnout rates as any county for any election year with a turnout rate that exceeds 3 standard deviations above the mean of turnout. This process removes the 11 counties mentioned above with a VAP/VEP greater than 100%, but plus an additional 21 (extremely low population) counties with a turnout rate that is likely overestimated due to an undercounted denominator. With over 3,000 counties in our sample for each election year, we’re losing very little sample space by doing this!
Since I intend to use data from 2012, 2016, and 2020 presidential elections to build the turnout model, I thought it best to next examine election turnout rates over time for the top 25 counties, as I did earlier with SVI and its four component themes.
Finally, having combined turnout rates and corresponding 2-year-lagged SVI data for the three presidential election years, I plotted the relationship between overall SVI at the county level and county-level turnout rates.
The image above suggests a relatively moderate negative Pearson correlation (R) between SVI and turnout rates. I thought this was pretty darn interesting, so I posted a version of this chart on LinkedIn.
I then examined this relationship further by examining the correlation at the 4 primary SVI component measures as compared to turnout rates, where I also found remarkable stability within each SVI component measure over the three presidential elections.
Just from a cursory scan at Pearson correlation broken out by the four major SVI components, it looks like the real breadwinner varibles for a potential turnout model will come from the Socioeconomic Status and Housing Type/Transportation categories, with perhaps some useful information in the Household Characteristics category. It seems unintuitive that racial/ethnic identity would have a meaningful relationship with turnout rates for a county, but we’ll also consider component features from this SVI category as we build our prediction model.
OK, enough description of the data - let’s use SVI to predict the future!
We want a model that will predict turnout for election year YYYY based on the SVI file for YYYY - 2. Also, it is my hope that our model will predict not absolute turnout (that is, number of votes in each county), but rather, a US county’s relative turnout expressed as a multiplicative factor of the state-level turnout.
Since different states will have different degrees of turnout relative to other factors (e.g., ad spending, swing state status, etc.), the model is intended to be used as a means to predict turnout relativities between counties in a particular state. I will demonstrate usage of the model in a final section to predict turnout in my home state of North Carolina in the 2024 election.
(Also, if you’re following along at home, I designed this model like a risk adjustment model.)
In short, we will train our model on a response that is the value of turnout multiplicitavely scaled by dividing the turnout rate variable for each county by its own average across all counties for each election year. Then, when our model predicts a turnout score of 1.00 +/- x, we will interpret that to mean that the county is expected to have a turnout equal to 1.00 +/- x times the average turnout across all counties for that election (represented by 1.00).
Furthermore, 2020 presented some unique challenges (and unique motivations) that might not reflect more stable patterns reflecting the general relationship between SVI variables and voter turnout.
Before writing all this up, I examined the performance between two models - one trained including 2020 data, and one trained without 2020 data - and found no material difference in predictive performance.
Therefore, 2020 data is included.
Ideally we would want a model that is trained on pre-2024 returns data, but predictive of turnout outcomes in 2024. The variables used to measure overall SVI across all four categories which are present across all versions of the SVI data available (2010 - 2022) are as follows:
| SVI Feature | Description |
|---|---|
| EPL_POV* | Percentile percentage of persons below 100% (2010 - 2020) / 150% (2020+) poverty estimate |
| EPL_UNEMP | Percentile percentage of civilian (age 16+) unemployed estimate |
| EPL_NOHSDP | Percentile percentage of persons with no high school diploma (age 25+) estimate |
| EPL_AGE65 | Percentile percentage of persons aged 65 and older estimate |
| EPL_AGE17 | Percentile percentage of persons aged 17 and younger estimate |
| EPL_SNGPNT | Percentile percentage of single-parent households with children under 18 estimate |
| EPL_LIMENG | Percentile percentage of persons (age 5+) who speak English “less than well” |
| EPL_MINRTY | Percentile percentage of persons who identify as a racial/ethnic identity in the minority |
| EPL_MUNIT | Percentile percentage housing in structures with 10 or more units |
| EPL_MOBILE | Percentile percentage mobile homes |
| EPL_CROWD | Percentile percentage households with more than 1 person per room |
| EPL_NOVEH | Percentile percentage households with no vehicle available |
| EPL_GROUPQ | Percentile percentage of persons in group quarters estimate |
Often a very good first step in building a predictive model is to get a handle on your feature space - including understanding their distribution. Since the EPL_* varaibles are percentile ranks, we can expect that these variables are all likely expressive of a uniform distribution across a support of 0 to 100%. Let’s confirm that now:
## Joining with `by = join_by(GEOID, year)`
We have now confirmed that every variable in the feature space reflects a Uniform(0,100) distribution. This might not be an approach that will lead to a feature space that sets us up with a high-performing predictive model, so let’s investigate the non-percentile-ranked (i.e., estimates of proportions) variables underlying these within the loaded SVI datasets.
Each of these percentile ranked measurements are based on a US-wide percentile ranking of underlying proportion measures that are drawn directly from ACS 5-year data for the applicable time period. Each of these proportion measures is also present in SVI datasets.
| SVI Feature | Description |
|---|---|
| EP_POV* | Percentage of persons below 100% (2010 - 2020) / 150% (2020+) poverty estimate |
| EP_UNEMP | Percentage of civilian (age 16+) unemployed estimate |
| EP_NOHSDP | Percentage of persons with no high school diploma (age 25+) estimate |
| EP_AGE65 | Percentage of persons aged 65 and older estimate |
| EP_AGE17 | Percentage of persons aged 17 and younger estimate |
| EP_SNGPNT | Percentage of single-parent households with children under 18 estimate |
| EP_LIMENG | Percentage of persons (age 5+) who speak English “less than well” |
| EP_MINRTY | Percentage of persons who identify as a racial/ethnic identity in the minority |
| EP_MUNIT | Percentage housing in structures with 10 or more units |
| EP_MOBILE | Percentage mobile homes |
| EP_CROWD | Percentage households with more than 1 person per room |
| EP_NOVEH | Percentage households with no vehicle available |
| EP_GROUPQ | Percentage of persons in group quarters estimate |
## Joining with `by = join_by(GEOID, year)`
Now we will build some candidate models for our final predictive
model for relative county-level turnout. This gives us an overview of
how this feature set’s relationship with the target response
(scaled_turnout) are described by the underlying data.
##
## Call:
## lm(formula = formula_ep, data = all_years)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47285 -0.06017 -0.00932 0.05096 0.62296
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.216e+00 1.852e-02 65.665 < 2e-16 ***
## EP_POV -4.063e-03 2.664e-04 -15.249 < 2e-16 ***
## EP_UNEMP -1.242e-03 3.977e-04 -3.123 0.00180 **
## EP_NOHSDP -3.424e-03 2.673e-04 -12.808 < 2e-16 ***
## EP_AGE65 5.056e-04 3.741e-04 1.352 0.17654
## EP_AGE17 -1.460e-03 5.338e-04 -2.734 0.00627 **
## EP_SNGPNT -7.193e-03 5.330e-04 -13.493 < 2e-16 ***
## EP_LIMENG -2.637e-03 5.795e-04 -4.550 5.43e-06 ***
## EP_MINRTY 2.204e-03 8.447e-05 26.097 < 2e-16 ***
## EP_MUNIT -5.470e-04 2.972e-04 -1.840 0.06578 .
## EP_MOBILE 2.681e-04 1.550e-04 1.730 0.08370 .
## EP_CROWD -6.706e-03 8.149e-04 -8.229 < 2e-16 ***
## EP_NOVEH -3.057e-03 3.928e-04 -7.781 7.96e-15 ***
## EP_GROUPQ -1.165e-02 2.971e-04 -39.218 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09746 on 9277 degrees of freedom
## Multiple R-squared: 0.3978, Adjusted R-squared: 0.3969
## F-statistic: 471.3 on 13 and 9277 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = formula_epl, data = all_years)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46972 -0.05836 -0.00347 0.05483 0.64986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.153374 0.007673 150.319 < 2e-16 ***
## EPL_POV -0.074055 0.006143 -12.055 < 2e-16 ***
## EPL_UNEMP -0.017210 0.004770 -3.608 0.000311 ***
## EPL_NOHSDP -0.141038 0.006081 -23.193 < 2e-16 ***
## EPL_AGE65 0.037088 0.005111 7.257 4.28e-13 ***
## EPL_AGE17 0.015648 0.005177 3.022 0.002514 **
## EPL_SNGPNT -0.058820 0.005571 -10.559 < 2e-16 ***
## EPL_LIMENG -0.033557 0.004870 -6.890 5.95e-12 ***
## EPL_MINRTY 0.132302 0.005455 24.253 < 2e-16 ***
## EPL_MUNIT -0.047356 0.005250 -9.021 < 2e-16 ***
## EPL_MOBILE -0.006207 0.005429 -1.143 0.252902
## EPL_CROWD -0.009761 0.004717 -2.069 0.038554 *
## EPL_NOVEH -0.026778 0.005033 -5.320 1.06e-07 ***
## EPL_GROUPQ -0.137538 0.004014 -34.263 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09689 on 9277 degrees of freedom
## Multiple R-squared: 0.4047, Adjusted R-squared: 0.4039
## F-statistic: 485.2 on 13 and 9277 DF, p-value: < 2.2e-16
Out of the gate, our basic vanilla linear models explain the data quite well, with an adjusted R-squared hovering around 40% and with a feature space filled with highly-significant variables, for both the model based on EP_ (proportion estimates) variables as well as the model based on EPL_ (percentile ranked) variables in the SVI. So, that’s great news!
Given that we are getting similar predictive performance from both the EP_ and EPL_ series variables, I’ve chosen to proceed with the EP_ variables because the coefficients for our model will have a more intuitive, commonsense interpretation than with the EPL_ variables. The coefficients corresponding to EP_ variables allow us to make statements like “For every 1% increase in the estimated proportion of X, we can expect a Y% increase/decrease in turnout for that county.” You know, for the sake of model explainability and all!
We might imagine there is substantial multicollinearity in the feature space, so we should be aware of any strong correlations between our 13 features going into the modeling project. As a matter of good model design practice, let’s take a peek at a correlation matrix for our SVI features. We’ll see nothing here that doesn’t make a lot of sense.
Here, we see some obvious correlation between several of these features. One obvious example to call out is the moderate negative correlation between EP_AGE17 and EP_AGE65. Our hope is that we can rely on regularization to mute (or deselect entirely) some of these features where multicollinearity between certain features can inject bias into our predictive model.
Using R’s glmnet package, we can use regularization to
arrive at a more parsimonious (i.e., potentially fewer features) model
with at least similar predictive performance to the basic OLS models we
fitted to SVI variables earlier. This approach will also give
us the ability to consider any number of interactions between these
features, where interactions with a less-than-impactful contribution to
predicting the response will be “penalized” out through the
regularization process.
As mentioned above approach will also have the benefit of hopefully
removing (via regularization penalization) some of the SVI features
where there is excessive multicollinearity (if any). Using
glmnet, I’ve fixed the alpha parameter to 1, forcing a
lasso regularization regime with L2 distances penalization, which is
known to be better at “culling” the feature space to arrive at a more
parsimonious model with fewer features).
Before fitting the regularized model, several refinements to the feature set are warranted:
Removing EP_POV: The poverty variable (EP_POV) changed its definition from 100% of the federal poverty level (in SVI 2010-2018) to 150% of the poverty level (in SVI 2020+). This introduces a structural break in the feature across the training data, making it unreliable as a predictor. It is excluded from the final model.
Log-transforming skewed variables: Several EP_
variables (EP_LIMENG, EP_CROWD, EP_MUNIT) exhibit strong right-skew,
with many counties clustered near zero. Applying a
log(1 + x) transformation improves linearity with the
response.
Adding log(population): County total population
(E_TOTPOP, already present in the SVI data) serves as a
proxy for county scale and urbanicity. We include
log(E_TOTPOP) to capture the well-known relationship
between county size and turnout patterns.
Adding election year: Including the election year as a numeric feature allows the model to capture secular shifts in the relationship between SVI variables and turnout over time. Lasso regularization will zero this out if it is not informative.
Weighting by county size: Counties vary enormously
in population, and ACS estimates for small counties have substantially
higher standard errors. Weighting observations by log(VEP)
in the loss function allows the model to focus on counties where the
signal-to-noise ratio is highest.
Leave-one-year-out cross-validation: Rather than random 10-fold CV (which mixes counties from different election years in each fold), we use leave-one-year-out CV. This trains on two election cycles and tests on the third, directly measuring whether the model generalizes across elections — the actual use case.
Using the refined feature matrix (with all pairwise interactions),
below shows a chart with the regressor values for primary variables and
interactions that emerged from the glmnet process (where
features where the absolute value of the regressor term is greater than
.0005, as well as the intercept, are both excluded).
## [1] "Leave-one-year-out CV R-squared (on scaled turnout): 0.4973"
Now that we have a trained model, let’s see how it performs on completely unseen data: the 2024 presidential election. The model was trained exclusively on 2012, 2016, and 2020 data, so 2024 represents a true out-of-sample validation.
First, let’s zoom in on my home state of North Carolina, where the model performs quite well on the 2024 data.
| LOCATION | Predicted_Votes | Actual_Votes | Absolute_Error_Pct |
|---|---|---|---|
| Alamance County, North Carolina | 94521 | 89831 | 5.22% |
| Alexander County, North Carolina | 21140 | 20677 | 2.24% |
| Alleghany County, North Carolina | 6597 | 6496 | 1.55% |
| Anson County, North Carolina | 11691 | 10875 | 7.50% |
| Ashe County, North Carolina | 16098 | 16253 | 0.95% |
| Avery County, North Carolina | 9141 | 9489 | 3.67% |
| Beaufort County, North Carolina | 25899 | 26572 | 2.53% |
| Bertie County, North Carolina | 10132 | 9186 | 10.30% |
| Bladen County, North Carolina | 17273 | 16764 | 3.04% |
| Brunswick County, North Carolina | 97205 | 109378 | 11.13% |
| Buncombe County, North Carolina | 166496 | 160510 | 3.73% |
| Burke County, North Carolina | 49063 | 45847 | 7.01% |
| Cabarrus County, North Carolina | 127578 | 120202 | 6.14% |
| Caldwell County, North Carolina | 43357 | 43540 | 0.42% |
| Camden County, North Carolina | 6225 | 6304 | 1.25% |
| Carteret County, North Carolina | 43071 | 45817 | 5.99% |
| Caswell County, North Carolina | 13078 | 12040 | 8.62% |
| Catawba County, North Carolina | 88949 | 87109 | 2.11% |
| Chatham County, North Carolina | 47385 | 52301 | 9.40% |
| Cherokee County, North Carolina | 18831 | 17824 | 5.65% |
| Chowan County, North Carolina | 7877 | 7552 | 4.30% |
| Clay County, North Carolina | 6921 | 7728 | 10.44% |
| Cleveland County, North Carolina | 54445 | 51706 | 5.30% |
| Columbus County, North Carolina | 27406 | 26402 | 3.80% |
| Craven County, North Carolina | 56187 | 56173 | 0.02% |
| Cumberland County, North Carolina | 170093 | 140513 | 21.05% |
| Currituck County, North Carolina | 18231 | 18053 | 0.99% |
| Dare County, North Carolina | 25162 | 25196 | 0.13% |
| Davidson County, North Carolina | 96294 | 93452 | 3.04% |
| Davie County, North Carolina | 25517 | 26850 | 4.96% |
| Duplin County, North Carolina | 23038 | 22898 | 0.61% |
| Durham County, North Carolina | 176527 | 180912 | 2.42% |
| Edgecombe County, North Carolina | 26723 | 24448 | 9.31% |
| Forsyth County, North Carolina | 205895 | 204726 | 0.57% |
| Franklin County, North Carolina | 39148 | 42667 | 8.25% |
| Gaston County, North Carolina | 129732 | 119256 | 8.78% |
| Gates County, North Carolina | 6565 | 5868 | 11.88% |
| Graham County, North Carolina | 4478 | 4779 | 6.30% |
| Granville County, North Carolina | 33229 | 32104 | 3.50% |
| Greene County, North Carolina | 8815 | 8450 | 4.32% |
| Guilford County, North Carolina | 291496 | 285053 | 2.26% |
| Halifax County, North Carolina | 26486 | 23965 | 10.52% |
| Harnett County, North Carolina | 73845 | 63757 | 15.82% |
| Haywood County, North Carolina | 38668 | 37851 | 2.16% |
| Henderson County, North Carolina | 73629 | 69974 | 5.22% |
| Hertford County, North Carolina | 11658 | 9843 | 18.44% |
| Hoke County, North Carolina | 27570 | 22767 | 21.10% |
| Hyde County, North Carolina | 2544 | 2421 | 5.08% |
| Iredell County, North Carolina | 110927 | 110875 | 0.05% |
| Jackson County, North Carolina | 23781 | 21942 | 8.38% |
| Johnston County, North Carolina | 123333 | 124678 | 1.08% |
| Jones County, North Carolina | 5590 | 5463 | 2.32% |
| Lee County, North Carolina | 32258 | 30081 | 7.24% |
| Lenoir County, North Carolina | 27960 | 27503 | 1.66% |
| Lincoln County, North Carolina | 53579 | 55582 | 3.60% |
| McDowell County, North Carolina | 24213 | 23655 | 2.36% |
| Macon County, North Carolina | 22763 | 21934 | 3.78% |
| Madison County, North Carolina | 12846 | 13621 | 5.69% |
| Martin County, North Carolina | 12445 | 12040 | 3.36% |
| Mecklenburg County, North Carolina | 631990 | 577505 | 9.43% |
| Mitchell County, North Carolina | 8485 | 8842 | 4.04% |
| Montgomery County, North Carolina | 13924 | 13206 | 5.44% |
| Moore County, North Carolina | 63371 | 61790 | 2.56% |
| Nash County, North Carolina | 51757 | 52471 | 1.36% |
| New Hanover County, North Carolina | 142128 | 138734 | 2.45% |
| Northampton County, North Carolina | 10690 | 9215 | 16.01% |
| Onslow County, North Carolina | 99666 | 81681 | 22.02% |
| Orange County, North Carolina | 80800 | 87807 | 7.98% |
| Pamlico County, North Carolina | 7927 | 7976 | 0.61% |
| Pasquotank County, North Carolina | 22594 | 20343 | 11.07% |
| Pender County, North Carolina | 35898 | 38909 | 7.74% |
| Perquimans County, North Carolina | 8082 | 7666 | 5.43% |
| Person County, North Carolina | 23667 | 22036 | 7.40% |
| Pitt County, North Carolina | 90605 | 87130 | 3.99% |
| Polk County, North Carolina | 13153 | 13068 | 0.65% |
| Randolph County, North Carolina | 78172 | 76008 | 2.85% |
| Richmond County, North Carolina | 21218 | 19873 | 6.77% |
| Robeson County, North Carolina | 55314 | 46770 | 18.27% |
| Rockingham County, North Carolina | 49021 | 49595 | 1.16% |
| Rowan County, North Carolina | 78760 | 75394 | 4.46% |
| Rutherford County, North Carolina | 36299 | 34670 | 4.70% |
| Sampson County, North Carolina | 28371 | 28201 | 0.60% |
| Scotland County, North Carolina | 16249 | 14626 | 11.10% |
| Stanly County, North Carolina | 35295 | 36714 | 3.87% |
| Stokes County, North Carolina | 27084 | 27175 | 0.33% |
| Surry County, North Carolina | 37355 | 37508 | 0.41% |
| Swain County, North Carolina | 7621 | 7052 | 8.07% |
| Transylvania County, North Carolina | 21513 | 20780 | 3.53% |
| Tyrrell County, North Carolina | 1757 | 1757 | 0.00% |
| Union County, North Carolina | 141276 | 139355 | 1.38% |
| Vance County, North Carolina | 22420 | 20092 | 11.59% |
| Wake County, North Carolina | 680297 | 653580 | 4.09% |
| Warren County, North Carolina | 10849 | 10013 | 8.35% |
| Washington County, North Carolina | 6365 | 5944 | 7.08% |
| Watauga County, North Carolina | 30977 | 33095 | 6.40% |
| Wayne County, North Carolina | 59554 | 54762 | 8.75% |
| Wilkes County, North Carolina | 34834 | 36320 | 4.09% |
| Wilson County, North Carolina | 40865 | 40045 | 2.05% |
| Yadkin County, North Carolina | 21034 | 20397 | 3.12% |
| Yancey County, North Carolina | 11080 | 11283 | 1.80% |
## # A tibble: 1 × 3
## Predicted_Votes Actual_Votes MAPE
## <int> <dbl> <chr>
## 1 5909921 5699141 5.62%
## [1] "NC R-squared (scaled turnout): 0.6493"
Having demonstrated the model’s performance on North Carolina, let’s now evaluate how the model performs across all US states (excluding Alaska) in the 2024 presidential election.
## [1] "National R-squared (scaled turnout): 0.4679"
## [1] "National MAE: 3438 votes"
## [1] "National MAPE: 7.26%"
## [1] "Counties evaluated: 3072"
| State | N_Counties | Total_Actual | Total_Predicted | Total_Error_Pct | MAE | MAPE | R_Squared |
|---|---|---|---|---|---|---|---|
| CA | 58 | 15862678 | 16199887 | 2.13% | 12047 | 6.60% | 0.5460 |
| TX | 250 | 11387360 | 11596840 | 1.84% | 3376 | 10.87% | 0.4900 |
| FL | 67 | 10893752 | 11720480 | 7.59% | 15003 | 7.40% | 0.5527 |
| NY | 62 | 8262495 | 7763967 | 6.03% | 9881 | 8.69% | 0.6933 |
| PA | 67 | 7034206 | 7182482 | 2.11% | 3486 | 4.78% | 0.7018 |
| OH | 88 | 5767788 | 5945850 | 3.09% | 3486 | 4.06% | 0.8056 |
| NC | 100 | 5699141 | 5909968 | 3.70% | 3167 | 5.62% | 0.6493 |
| MI | 83 | 5664186 | 5959316 | 5.21% | 4105 | 4.58% | 0.6153 |
| IL | 102 | 5633310 | 5922508 | 5.13% | 3815 | 4.76% | 0.6521 |
| GA | 158 | 5248353 | 5458558 | 4.01% | 2447 | 9.06% | 0.5844 |
| VA | 133 | 4482576 | 4755598 | 6.09% | 2803 | 6.83% | 0.7335 |
| NJ | 21 | 4272725 | 4560349 | 6.73% | 14003 | 5.92% | 0.8429 |
| WA | 39 | 3924243 | 4291091 | 9.35% | 10217 | 6.59% | 0.5989 |
| MA | 14 | 3473668 | 3666895 | 5.56% | 15287 | 5.55% | 0.6886 |
| WI | 72 | 3422918 | 3650889 | 6.66% | 3309 | 7.48% | 0.7821 |
| AZ | 15 | 3412953 | 3646607 | 6.85% | 18729 | 8.67% | 0.1798 |
| MN | 87 | 3253920 | 3488831 | 7.22% | 2882 | 6.96% | 0.4305 |
| CO | 61 | 3169115 | 3479863 | 9.81% | 5330 | 9.07% | 0.5793 |
| TN | 95 | 3063942 | 3168839 | 3.42% | 2681 | 5.90% | 0.5188 |
| MD | 24 | 3038334 | 3231753 | 6.37% | 8770 | 4.97% | 0.7843 |
| IN | 92 | 2936677 | 2999841 | 2.15% | 1987 | 4.48% | 0.7215 |
| MO | 115 | 2871039 | 2978304 | 3.74% | 2513 | 7.89% | 0.4513 |
| SC | 46 | 2548140 | 2597442 | 1.93% | 2535 | 6.04% | 0.4997 |
| AL | 67 | 2264972 | 2285343 | 0.90% | 1962 | 6.76% | 0.3360 |
| OR | 36 | 2244493 | 2371007 | 5.64% | 4279 | 5.28% | 0.4553 |
| KY | 120 | 2074530 | 2051083 | 1.13% | 924 | 6.53% | 0.6011 |
| LA | 64 | 2006975 | 1952339 | 2.72% | 2351 | 10.60% | 0.2879 |
| IA | 99 | 1674011 | 1740455 | 3.97% | 1054 | 6.17% | 0.4188 |
| OK | 77 | 1566173 | 1592185 | 1.66% | 1213 | 6.38% | 0.5824 |
| UT | 28 | 1487944 | 1702580 | 14.43% | 8036 | 7.68% | 0.1830 |
| NV | 17 | 1484840 | 1563282 | 5.28% | 7052 | 9.09% | 0.5588 |
| KS | 105 | 1327591 | 1398320 | 5.33% | 896 | 7.68% | 0.5295 |
| MS | 82 | 1228008 | 1194542 | 2.73% | 1126 | 9.52% | 0.3399 |
| AR | 72 | 1165888 | 1161816 | 0.35% | 1214 | 8.33% | 0.5332 |
| NE | 93 | 947159 | 1007597 | 6.38% | 762 | 6.95% | 0.4999 |
| ID | 44 | 905057 | 969188 | 7.09% | 1820 | 9.58% | 0.3677 |
| NH | 10 | 826189 | 889399 | 7.65% | 6615 | 5.44% | 0.5076 |
| ME | 16 | 824806 | 863134 | 4.65% | 2531 | 4.34% | 0.6715 |
| NM | 16 | 809679 | 840895 | 3.86% | 2959 | 8.68% | 0.6120 |
| WV | 55 | 762390 | 726293 | 4.73% | 932 | 7.71% | 0.5818 |
| MT | 55 | 602163 | 635947 | 5.61% | 908 | 9.05% | 0.5143 |
| HI | 4 | 516701 | 534780 | 3.50% | 9135 | 6.30% | 0.5872 |
| RI | 5 | 511816 | 510252 | 0.31% | 5562 | 6.35% | 0.3655 |
| DE | 3 | 511697 | 545006 | 6.51% | 15127 | 7.39% | 0.4884 |
| SD | 65 | 425860 | 433859 | 1.88% | 333 | 7.65% | 0.6312 |
| VT | 14 | 372885 | 381514 | 2.31% | 838 | 3.68% | 0.6766 |
| ND | 52 | 367508 | 370178 | 0.73% | 425 | 6.98% | 0.4408 |
| DC | 1 | 325879 | 288559 | 11.45% | 37320 | 11.45% | NA |
| WY | 23 | 269048 | 289065 | 7.44% | 1191 | 6.72% | 0.5228 |
State-Level Prediction Performance (2024)